Boosting a decision stump

The goal of this notebook is to implement your own boosting module.

Brace yourselves! This is going to be a fun and challenging assignment.

  • Use SFrames to do some feature engineering.
  • Modify the decision trees to incorporate weights.
  • Implement Adaboost ensembling.
  • Use your implementation of Adaboost to train a boosted decision stump ensemble.
  • Evaluate the effect of boosting (adding more decision stumps) on performance of the model.
  • Explore the robustness of Adaboost to overfitting.

Let's get started!

Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create (1.8.3 or newer). Upgrade by

   pip install graphlab-create --upgrade

See this page for detailed instructions on upgrading.

In [1]:
import graphlab as gl
print('gl.version: %s' % (gl.version))
gl.canvas.set_target('ipynb')
import matplotlib.pyplot as plt
%matplotlib inline

# my imports
import math
import numpy as np
import pandas as pd
from six.moves import cPickle as pickle
import string
gl.version: 1.8.4
This non-commercial license of GraphLab Create is assigned to bbalaji8@gmail.com and will expire on December 09, 2016. For commercial licensing options, visit https://dato.com/buy/.
2016-03-17 09:20:00,264 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.4 started. Logging: /tmp/graphlab_server_1458220797.log

Getting the data ready

We will be using the same LendingClub dataset as in the previous assignment.

In [2]:
#loans = gl.SFrame('lending-club-data.gl/')
glbObsAll = gl.load_sframe('data/module-6-decision-tree-practical-assignment_glbObsAll.gl')
print(glbObsAll.shape)
glbObsAll.show()
print(glbObsAll)
(122607, 68)
+---------+-----------+-----------+-------------+-----------------+------------+
|    id   | member_id | loan_amnt | funded_amnt | funded_amnt_inv |    term    |
+---------+-----------+-----------+-------------+-----------------+------------+
| 1077501 |  1296599  |    5000   |     5000    |       4975      |  36 months |
| 1077430 |  1314167  |    2500   |     2500    |       2500      |  60 months |
| 1077175 |  1313524  |    2400   |     2400    |       2400      |  36 months |
| 1076863 |  1277178  |   10000   |    10000    |      10000      |  36 months |
| 1075269 |  1311441  |    5000   |     5000    |       5000      |  36 months |
| 1072053 |  1288686  |    3000   |     3000    |       3000      |  36 months |
| 1071795 |  1306957  |    5600   |     5600    |       5600      |  60 months |
| 1071570 |  1306721  |    5375   |     5375    |       5350      |  60 months |
| 1070078 |  1305201  |    6500   |     6500    |       6500      |  60 months |
| 1069908 |  1305008  |   12000   |    12000    |      12000      |  36 months |
+---------+-----------+-----------+-------------+-----------------+------------+
+----------+-------------+-------+-----------+-----------------------+------------+
| int_rate | installment | grade | sub_grade |       emp_title       | emp_length |
+----------+-------------+-------+-----------+-----------------------+------------+
|  10.65   |    162.87   |   B   |     B2    |                       | 10+ years  |
|  15.27   |    59.83    |   C   |     C4    |         Ryder         |  < 1 year  |
|  15.96   |    84.33    |   C   |     C5    |                       | 10+ years  |
|  13.49   |    339.31   |   C   |     C1    |  AIR RESOURCES BOARD  | 10+ years  |
|   7.9    |    156.46   |   A   |     A4    |  Veolia Transportaton |  3 years   |
|  18.64   |    109.43   |   E   |     E1    |    MKC Accounting     |  9 years   |
|  21.28   |    152.39   |   F   |     F2    |                       |  4 years   |
|  12.69   |    121.45   |   B   |     B5    |       Starbucks       |  < 1 year  |
|  14.65   |    153.45   |   C   |     C3    | Southwest Rural metro |  5 years   |
|  12.69   |    402.54   |   B   |     B5    |          UCLA         | 10+ years  |
+----------+-------------+-------+-----------+-----------------------+------------+
+----------------+------------+-----------------+-----------------+-------------+
| home_ownership | annual_inc |     is_inc_v    |     issue_d     | loan_status |
+----------------+------------+-----------------+-----------------+-------------+
|      RENT      |   24000    |     Verified    | 20111201T000000 |  Fully Paid |
|      RENT      |   30000    | Source Verified | 20111201T000000 | Charged Off |
|      RENT      |   12252    |   Not Verified  | 20111201T000000 |  Fully Paid |
|      RENT      |   49200    | Source Verified | 20111201T000000 |  Fully Paid |
|      RENT      |   36000    | Source Verified | 20111201T000000 |  Fully Paid |
|      RENT      |   48000    | Source Verified | 20111201T000000 |  Fully Paid |
|      OWN       |   40000    | Source Verified | 20111201T000000 | Charged Off |
|      RENT      |   15000    |     Verified    | 20111201T000000 | Charged Off |
|      OWN       |   72000    |   Not Verified  | 20111201T000000 |  Fully Paid |
|      OWN       |   75000    | Source Verified | 20111201T000000 |  Fully Paid |
+----------------+------------+-----------------+-----------------+-------------+
+------------+-------------------------------+-------------------------------+-----+
| pymnt_plan |              url              |              desc             | ... |
+------------+-------------------------------+-------------------------------+-----+
|     n      | https://www.lendingclub.co... |   Borrower added on 12/22/... | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/22/... | ... |
|     n      | https://www.lendingclub.co... |                               | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/21/... | ... |
|     n      | https://www.lendingclub.co... |                               | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/16/... | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/21/... | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/16/... | ... |
|     n      | https://www.lendingclub.co... |   Borrower added on 12/15/... | ... |
|     n      | https://www.lendingclub.co... |                               | ... |
+------------+-------------------------------+-------------------------------+-----+
[122607 rows x 68 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Extracting the target and the feature columns

We will now repeat some of the feature processing steps that we saw in the previous assignment:

First, we re-assign the target to have +1 as a safe (good) loan, and -1 as a risky (bad) loan.

Next, we select four categorical features:

  1. grade of the loan
  2. the length of the loan term
  3. the home ownership status: own, mortgage, rent
  4. number of years of employment.
In [3]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
# loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
# loans.remove_column('bad_loans')
target = 'safe_loan'
#loans = loans[features + [target]]
glbObsAll, glbObsAll_with_na = glbObsAll[[target] + features].dropna_split()

# Count the number of rows with missing data
num_rows_with_na = glbObsAll_with_na.num_rows()
num_rows = glbObsAll.num_rows()
print 'Dropping %s observations; keeping %s ' % (num_rows_with_na, num_rows)
Dropping 0 observations; keeping 122607 

Subsample dataset to make sure classes are balanced

Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We use seed=1 so everyone gets the same results.

In [ ]:
# safe_loans_raw = loans[loans[target] == 1]
# risky_loans_raw = loans[loans[target] == -1]

# # Undersample the safe loans.
# percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
# risky_loans = risky_loans_raw
# safe_loans = safe_loans_raw.sample(percentage, seed=1)
# glbObsSmp = risky_loans_raw.append(safe_loans)

# print "Percentage of safe loans                 :", len(safe_loans) / float(len(glbObsSmp))
# print "Percentage of risky loans                :", len(risky_loans) / float(len(glbObsSmp))
# print "Total number of loans in our new dataset :", len(glbObsSmp)
In [4]:
glbObsSfe = glbObsAll[glbObsAll[target] == +1]
glbObsRsk = glbObsAll[glbObsAll[target] == -1]

# Since there are less risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(glbObsRsk)/float(len(glbObsSfe))
glbObsSfeSmp = glbObsSfe.sample(percentage, seed = 1)
#risky_glbObsAll = glbObsRsk
glbObsSmp = glbObsRsk.append(glbObsSfeSmp)

print "Percentage of safe  loans                : %.2f%%" % \
        (len(glbObsSfeSmp) * 100.0 / float(len(glbObsSmp)))
print "Percentage of risky loans                : %.2f%%" % \
        (len(glbObsRsk)    * 100.0 / float(len(glbObsSmp)))
print "Total number of loans in our new dataset :", len(glbObsSmp)
Percentage of safe  loans                : 50.22%
Percentage of risky loans                : 49.78%
Total number of loans in our new dataset : 46508

Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

Transform categorical data into binary features

In this assignment, we will work with binary decision trees. Since all of our features are currently categorical features, we want to turn them into binary features using 1-hot encoding.

We can do so with the following code block (see the first assignments for more details):

In [5]:
#glbObsSmp = risky_loans.append(safe_loans)
for feature in features:
    glbObsSmp_one_hot_encoded = glbObsSmp[feature].apply(lambda x: {x: 1})    
    glbObsSmp_unpacked = glbObsSmp_one_hot_encoded.unpack(column_name_prefix=feature)
    
    # Change None's to 0's
    for column in glbObsSmp_unpacked.column_names():
        glbObsSmp_unpacked[column] = glbObsSmp_unpacked[column].fillna(0)

    glbObsSmp.remove_column(feature)
    glbObsSmp.add_columns(glbObsSmp_unpacked)

Let's see what the feature columns look like now:

In [6]:
features = glbObsSmp.column_names()
features.remove(target)  # Remove the response variable
features
Out[6]:
['grade.A',
 'grade.B',
 'grade.C',
 'grade.D',
 'grade.E',
 'grade.F',
 'grade.G',
 'term. 36 months',
 'term. 60 months',
 'home_ownership.MORTGAGE',
 'home_ownership.OTHER',
 'home_ownership.OWN',
 'home_ownership.RENT',
 'emp_length.1 year',
 'emp_length.10+ years',
 'emp_length.2 years',
 'emp_length.3 years',
 'emp_length.4 years',
 'emp_length.5 years',
 'emp_length.6 years',
 'emp_length.7 years',
 'emp_length.8 years',
 'emp_length.9 years',
 'emp_length.< 1 year',
 'emp_length.n/a']

Train-test split

We split the data into training and test sets with 80% of the data in the training set and 20% of the data in the test set. We use seed=1 so that everyone gets the same result.

In [7]:
#glbObsFit, glbObsOOB = glbObsSmp.random_split(0.8, seed=1)
glbObsFit, glbObsOOB = glbObsSmp.random_split(.8, seed=1)
print(glbObsFit.shape)
print(glbObsOOB.shape)
(37224, 26)
(9284, 26)

Weighted decision trees

Let's modify our decision tree code from Module 5 to support weighting of individual data points.

Weighted error definition

Consider a model with $N$ data points with:

  • Predictions $\hat{y}_1 ... \hat{y}_n$
  • Target $y_1 ... y_n$
  • Data point weights $\alpha_1 ... \alpha_n$.

Then the weighted error is defined by: $$ \mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \frac{\sum_{i=1}^{n} \alpha_i \times 1[y_i \neq \hat{y_i}]}{\sum_{i=1}^{n} \alpha_i} $$ where $1[y_i \neq \hat{y_i}]$ is an indicator function that is set to $1$ if $y_i \neq \hat{y_i}$.

Write a function to compute weight of mistakes

Write a function that calculates the weight of mistakes for making the "weighted-majority" predictions for a dataset. The function accepts two inputs:

  • nodeLabels: Targets $y_1 ... y_n$
  • obsWeights: Data point weights $\alpha_1 ... \alpha_n$

We are interested in computing the (total) weight of mistakes, i.e. $$ \mathrm{WM}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \sum_{i=1}^{n} \alpha_i \times 1[y_i \neq \hat{y_i}]. $$ This quantity is analogous to the number of mistakes, except that each mistake now carries different weight. It is related to the weighted error in the following way: $$ \mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \frac{\mathrm{WM}(\mathbf{\alpha}, \mathbf{\hat{y}})}{\sum_{i=1}^{n} \alpha_i} $$

The function getWgtdMistakesIntermediateNode should first compute two weights:

  • $\mathrm{WM}_{-1}$: weight of mistakes when all predictions are $\hat{y}_i = -1$ i.e $\mathrm{WM}(\mathbf{\alpha}, \mathbf{-1}$)
  • $\mathrm{WM}_{+1}$: weight of mistakes when all predictions are $\hat{y}_i = +1$ i.e $\mbox{WM}(\mathbf{\alpha}, \mathbf{+1}$)

    where $\mathbf{-1}$ and $\mathbf{+1}$ are vectors where all values are -1 and +1 respectively.

After computing $\mathrm{WM}_{-1}$ and $\mathrm{WM}_{+1}$, the function getWgtdMistakesIntermediateNode should return the lower of the two weights of mistakes, along with the class associated with that weight. We have provided a skeleton for you with YOUR CODE HERE to be filled in several places.

In [8]:
def getWgtdMistakesIntermediateNode(nodeLabels, obsWeights):
    # Sum the weights of all entries with label +1
    obsPveWeightsSum = sum(obsWeights[nodeLabels == +1])
    
    # Weight of mistakes for predicting all -1's is equal to the sum above
    ### YOUR CODE HERE
    wgtdMistakesAllNve = obsPveWeightsSum
    
    # Sum the weights of all entries with label -1
    ### YOUR CODE HERE
    obsNveWeightsSum = sum(obsWeights[nodeLabels == -1])
    
    # Weight of mistakes for predicting all +1's is equal to the sum above
    ### YOUR CODE HERE
    wgtdMistakesAllPve = obsNveWeightsSum
    
    # Return the tuple (weight, class_label) representing the lower of the two weights
    #    class_label should be an integer of value +1 or -1.
    # If the two weights are identical, return (wgtdMistakesAllPve, +1)
    ### YOUR CODE HERE
    if (wgtdMistakesAllPve <= wgtdMistakesAllNve):
        return((wgtdMistakesAllPve, +1))
    else:
        return((wgtdMistakesAllNve, -1))

Checkpoint: Test your getWgtdMistakesIntermediateNode function, run the following cell:

In [9]:
example_labels = gl.SArray([-1, -1, 1, 1, 1])
example_obsWeights = gl.SArray([1., 2., .5, 1., 1.])
if getWgtdMistakesIntermediateNode(example_labels, example_obsWeights) == (2.5, -1):
    print 'Test passed!'
else:
    print 'Test failed... try again!'
Test passed!

Recall that the classification error is defined as follows: $$ \mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# all data points}} $$

Quiz Question: If we set the weights $\mathbf{\alpha} = 1$ for all data points, how is the weight of mistakes $\mbox{WM}(\mathbf{\alpha}, \mathbf{\hat{y}})$ related to the classification error?

Function to pick best feature to split on

We continue modifying our decision tree code from the earlier assignment to incorporate weighting of individual data points. The next step is to pick the best feature to split on.

The getFeatureSplitBest function is similar to the one from the earlier assignment with two minor modifications:

  1. The function getFeatureSplitBest should now accept an extra parameter obsWeights to take account of weights of data points.
  2. Instead of computing the number of mistakes in the left and right side of the split, we compute the weight of mistakes for both sides, add up the two weights, and divide it by the total weight of the data.

Complete the following function. Comments starting with DIFFERENT HERE mark the sections where the weighted version differs from the original implementation.

In [11]:
# If the data is identical in each feature, this function should return None

def getFeatureSplitBest(data, features, target, obsWeights):
    
    # These variables will keep track of the best feature and the corresponding error
    featureBest = None
    errorBest = float('+inf') 
    nObs = float(len(data))

    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        # The right split will have all data points where the feature value is 1
        splitLft = data[data[feature] == 0]
        splitRgt = data[data[feature] != 0]
        
        # Apply the same filtering to obsWeights to create obsWeightsLft, obsWeightsRgt
        ## YOUR CODE HERE
        obsWeightsLft = obsWeights[data[feature] == 0]
        obsWeightsRgt = obsWeights[data[feature] != 0]
                    
        # DIFFERENT HERE
        # Calculate the weight of mistakes for left and right sides
        ## YOUR CODE HERE
        wgtdMistakesLft, classLft = \
            getWgtdMistakesIntermediateNode(splitLft[target], obsWeightsLft)
        wgtdMistakesRgt, classRgt = \
            getWgtdMistakesIntermediateNode(splitRgt[target], obsWeightsRgt)
        
        # DIFFERENT HERE
        # Compute weighted error by computing
        #  ( [weight of mistakes (left)] + [weight of mistakes (right)] ) / 
        #     [total weight of all data points]
        ## YOUR CODE HERE
        error = (wgtdMistakesLft + wgtdMistakesRgt) / sum(obsWeights)
        
        # If this is the best error we have found so far, store the feature and the error
        if error < errorBest:
            featureBest = feature
            errorBest = error
    
    # Return the best feature we found
    return featureBest

Checkpoint: Now, we have another checkpoint to make sure you are on the right track.

In [13]:
example_obsWeights = gl.SArray(len(glbObsFit)* [1.5])
if getFeatureSplitBest(glbObsFit, features, target, example_obsWeights) == 'term. 36 months':
    print 'Test passed!'
else:
    print 'Test failed... try again!'
Test passed!

Note. If you get an exception in the line of "the logical filter has different size than the array", try upgradting your GraphLab Create installation to 1.8.3 or newer.

Very Optional. Relationship between weighted error and weight of mistakes

By definition, the weighted error is the weight of mistakes divided by the weight of all data points, so $$ \mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \frac{\sum_{i=1}^{n} \alpha_i \times 1[y_i \neq \hat{y_i}]}{\sum_{i=1}^{n} \alpha_i} = \frac{\mathrm{WM}(\mathbf{\alpha}, \mathbf{\hat{y}})}{\sum_{i=1}^{n} \alpha_i}. $$

In the code above, we obtain $\mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}})$ from the two weights of mistakes from both sides, $\mathrm{WM}(\mathbf{\alpha}_{\mathrm{left}}, \mathbf{\hat{y}}_{\mathrm{left}})$ and $\mathrm{WM}(\mathbf{\alpha}_{\mathrm{right}}, \mathbf{\hat{y}}_{\mathrm{right}})$. First, notice that the overall weight of mistakes $\mathrm{WM}(\mathbf{\alpha}, \mathbf{\hat{y}})$ can be broken into two weights of mistakes over either side of the split: $$ \mathrm{WM}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \sum{i=1}^{n} \alpha_i \times 1[y_i \neq \hat{y_i}] = \sum{\mathrm{left}} \alpha_i \times 1[y_i \neq \hat{y_i}]

  • \sum{\mathrm{right}} \alpha_i \times 1[y_i \neq \hat{y_i}]\ = \mathrm{WM}(\mathbf{\alpha}{\mathrm{left}}, \mathbf{\hat{y}}{\mathrm{left}}) + \mathrm{WM}(\mathbf{\alpha}{\mathrm{right}}, \mathbf{\hat{y}}{\mathrm{right}}) $$ We then divide through by the total weight of all data points to obtain $\mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}})$: $$ \mathrm{E}(\mathbf{\alpha}, \mathbf{\hat{y}}) = \frac{\mathrm{WM}(\mathbf{\alpha}{\mathrm{left}}, \mathbf{\hat{y}}{\mathrm{left}}) + \mathrm{WM}(\mathbf{\alpha}{\mathrm{right}}, \mathbf{\hat{y}}{\mathrm{right}})}{\sum{i=1}^{n} \alpha_i} $$

Building the tree

With the above functions implemented correctly, we are now ready to build our decision tree. Recall from the previous assignments that each node in the decision tree is represented as a dictionary which contains the following keys:

{ 
   'isLeaf'           : True/False.
   'prediction'       : Prediction at the leaf node.
   'lft'              : (dictionary corresponding to the left tree).
   'rgt'              : (dictionary corresponding to the right tree).
   'featureRemaining' : List of features that are posible splits.
}

Let us start with a function that creates a leaf node given a set of target values:

In [14]:
def bldLeaf(targetVctr, obsWeights):
    
    # Create a leaf node
    leaf = {'featureSplit' : None,
            'isLeaf'       : True}
    
    # Computed weight of mistakes.
    errorWgtd, classBest = getWgtdMistakesIntermediateNode(targetVctr, obsWeights)
    # Store the predicted class (1 or -1) in leaf['prediction']
    leaf['prediction'] = classBest ## YOUR CODE HERE
    
    return leaf 

We provide a function that learns a weighted decision tree recursively and implements 3 stopping conditions:

  1. All data points in a node are from the same class.
  2. No more features to split on.
  3. Stop growing the tree when the tree depth reaches depthMax.
In [15]:
def bldWgtdDTree(data, features, target, obsWeights, depthCur = 1, depthMax = 10):
    featureRemaining = features[:] # Make a copy of the features.
    targetVctr = data[target]
    print "--------------------------------------------------------------------"
    print "Subtree, depth = %s (%s data points)." % (depthCur, len(targetVctr))
    
    # Stopping condition 1. Error is 0.
    if getWgtdMistakesIntermediateNode(targetVctr, obsWeights)[0] <= 1e-15:
        print "Stopping condition 1 reached."                
        return bldLeaf(targetVctr, obsWeights)
    
    # Stopping condition 2. No more features.
    if featureRemaining == []:
        print "Stopping condition 2 reached."                
        return bldLeaf(targetVctr, obsWeights)    
    
    # Additional stopping condition (limit tree depth)
    if depthCur > depthMax:
        print "Reached maximum depth. Stopping for now."
        return bldLeaf(targetVctr, obsWeights)
    
    # If all the datapoints are the same, featureSplit will be None. Create a leaf
    featureSplit = getFeatureSplitBest(data, features, target, obsWeights)
    featureRemaining.remove(featureSplit)
        
    splitLft = data[data[featureSplit] == 0]
    splitRgt = data[data[featureSplit] != 0]
    
    obsWeightsLft = obsWeights[data[featureSplit] == 0]
    obsWeightsRgt = obsWeights[data[featureSplit] != 0]
    
    print "Split on feature %s. (%s, %s)" % (\
              featureSplit, len(splitLft), len(splitRgt))
    
    # Create a leaf node if the split is "perfect"
    if len(splitLft) == len(data):
        print "Creating leaf node."
        return bldLeaf(splitLft[target], obsWeights)
    if len(splitRgt) == len(data):
        print "Creating leaf node."
        return bldLeaf(splitRgt[target], obsWeights)
    
    # Repeat (recurse) on left and right subtrees
    treeLft = bldWgtdDTree(
        splitLft, featureRemaining, target, obsWeightsLft, depthCur + 1, depthMax)
    treeRgt = bldWgtdDTree(
        splitRgt, featureRemaining, target, obsWeightsRgt, depthCur + 1, depthMax)
    
    return {'isLeaf'       : False, 
            'prediction'   : None,
            'featureSplit' : featureSplit,
            'lft'          : treeLft, 
            'rgt'          : treeRgt}

Here is a recursive function to count the nodes in your tree:

In [16]:
def getNNodes(tree):
    if tree['isLeaf']:
        return 1
    return 1 + getNNodes(tree['lft']) + getNNodes(tree['rgt'])

Run the following test code to check your implementation. Make sure you get 'Test passed' before proceeding.

In [17]:
example_obsWeights = gl.SArray([1.0 for i in range(len(glbObsFit))])
smallDataWgtdDTree = bldWgtdDTree(glbObsFit, features, target,
                                        example_obsWeights, depthMax=2)
if getNNodes(smallDataWgtdDTree) == 7:
    print 'Test passed!'
else:
    print 'Test failed... try again!'
    print 'Number of nodes found:', getNNodes(smallDataWgtdDTree)
    print 'Number of nodes that should be there: 7' 
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature term. 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 2 (9223 data points).
Split on feature grade.A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 3 (9122 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 3 (101 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (28001 data points).
Split on feature grade.D. (23300, 4701)
--------------------------------------------------------------------
Subtree, depth = 3 (23300 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 3 (4701 data points).
Reached maximum depth. Stopping for now.
Test passed!

Let us take a quick look at what the trained tree is like. You should get something that looks like the following

{'isLeaf': False,
    'lft': {'isLeaf': False,
        'lft': {'isLeaf': True, 'prediction': -1, 'featureSplit': None},
        'prediction': None,
        'rgt': {'isLeaf': True, 'prediction': 1, 'featureSplit': None},
        'featureSplit': 'grade.A'
     },
    'prediction': None,
    'rgt': {'isLeaf': False,
        'lft': {'isLeaf': True, 'prediction': 1, 'featureSplit': None},
        'prediction': None,
        'rgt': {'isLeaf': True, 'prediction': -1, 'featureSplit': None},
        'featureSplit': 'grade.D'
     },
     'featureSplit': 'term. 36 months'
}
In [18]:
smallDataWgtdDTree
Out[18]:
{'featureSplit': 'term. 36 months',
 'isLeaf': False,
 'lft': {'featureSplit': 'grade.A',
  'isLeaf': False,
  'lft': {'featureSplit': None, 'isLeaf': True, 'prediction': -1},
  'prediction': None,
  'rgt': {'featureSplit': None, 'isLeaf': True, 'prediction': 1}},
 'prediction': None,
 'rgt': {'featureSplit': 'grade.D',
  'isLeaf': False,
  'lft': {'featureSplit': None, 'isLeaf': True, 'prediction': 1},
  'prediction': None,
  'rgt': {'featureSplit': None, 'isLeaf': True, 'prediction': -1}}}

Making predictions with a weighted decision tree

We give you a function that classifies one data point. It can also return the probability if you want to play around with that as well.

In [19]:
def classifyWgtdDTree(tree, x, annotate = False):   
    # If the node is a leaf node.
    if tree['isLeaf']:
        if annotate: 
            print "At leaf, predicting %s" % tree['prediction']
        return tree['prediction'] 
    else:
        # Split on feature.
        split_feature_value = x[tree['featureSplit']]
        if annotate: 
            print "Split on %s = %s" % (tree['featureSplit'], split_feature_value)
        if split_feature_value == 0:
            return classifyWgtdDTree(tree['lft'], x, annotate)
        else:
            return classifyWgtdDTree(tree['rgt'], x, annotate)

Evaluating the tree

Now, we will write a function to evaluate a decision tree by computing the classification error of the tree on the given dataset.

Again, recall that the classification error is defined as follows: $$ \mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# all data points}} $$

The function called evlClassificationError takes in as input:

  1. tree (as described above)
  2. data (an SFrame)

The function does not change because of adding data point weights.

In [20]:
def evlClassificationError(tree, data):
    # Apply the classifyWgtdDTree(tree, x) to each row in your data
    prediction = data.apply(lambda x: classifyWgtdDTree(tree, x))
    
    # Once you've made the predictions, calculate the classification error
    return (prediction != data[target]).sum() / float(len(data))
In [22]:
evlClassificationError(smallDataWgtdDTree, glbObsOOB)
Out[22]:
0.3981042654028436

Example: Training a weighted decision tree

To build intuition on how weighted data points affect the tree being built, consider the following:

Suppose we only care about making good predictions for the first 10 and last 10 items in glbObsFit, we assign weights:

  • 1 to the last 10 items
  • 1 to the first 10 items
  • and 0 to the rest.

Let us fit a weighted decision tree with depthMax = 2.

In [23]:
# Assign weights
example_obsWeights = gl.SArray([1.] * 10 + [0.]*(len(glbObsFit) - 20) + [1.] * 10)

# Train a weighted decision tree model.
smallDataSubset20WgtdDTree = bldWgtdDTree(glbObsFit, features, target,
                         example_obsWeights, depthMax=2)
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature home_ownership.RENT. (20514, 16710)
--------------------------------------------------------------------
Subtree, depth = 2 (20514 data points).
Split on feature grade.F. (19613, 901)
--------------------------------------------------------------------
Subtree, depth = 3 (19613 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 3 (901 data points).
Stopping condition 1 reached.
--------------------------------------------------------------------
Subtree, depth = 2 (16710 data points).
Split on feature grade.D. (13315, 3395)
--------------------------------------------------------------------
Subtree, depth = 3 (13315 data points).
Stopping condition 1 reached.
--------------------------------------------------------------------
Subtree, depth = 3 (3395 data points).
Stopping condition 1 reached.

Now, we will compute the classification error on the subset20ObsFit, i.e. the subset of data points whose weight is 1 (namely the first and last 10 data points).

In [24]:
subset20ObsFit = glbObsFit.head(10).append(glbObsFit.tail(10))
evlClassificationError(smallDataSubset20WgtdDTree, subset20ObsFit)
Out[24]:
0.05

Now, let us compare the classification error of the model smallDataSubset20WgtdDTree on the entire test set glbObsFit:

In [25]:
evlClassificationError(smallDataSubset20WgtdDTree, glbObsFit)
Out[25]:
0.48124865678057166

The model smallDataSubset20WgtdDTree performs a lot better on subset20ObsFit than on glbObsFit.

So, what does this mean?

  • The points with higher weights are the ones that are more important during the training process of the weighted decision tree.
  • The points with zero weights are basically ignored during training.

Quiz Question: Will you get the same model as smallDataSubset20WgtdDTree if you trained a decision tree with only the 20 data points with non-zero weights from the set of points in subset20ObsFit?

Implementing your own Adaboost (on decision stumps)

Now that we have a weighted decision tree working, it takes only a bit of work to implement Adaboost. For the sake of simplicity, let us stick with decision tree stumps by training trees with depthMax=1.

Recall from the lecture the procedure for Adaboost:

1. Start with unweighted data with $\alpha_j = 1$

2. For t = 1,...T:

  • Learn $f_t(x)$ with data weights $\alpha_j$
  • Compute coefficient $\hat{w}_t$: $$\hat{w}_t = \frac{1}{2}\ln{\left(\frac{1- \mbox{E}(\mathbf{\alpha}, \mathbf{\hat{y}})}{\mbox{E}(\mathbf{\alpha}, \mathbf{\hat{y}})}\right)}$$
  • Re-compute weights $\alpha_j$: $$\alpha_j \gets \begin{cases} \alpha_j \exp{(-\hat{w}_t)} & \text{ if }f_t(x_j) = y_j\\ \alpha_j \exp{(\hat{w}_t)} & \text{ if }f_t(x_j) \neq y_j \end{cases}$$
  • Normalize weights $\alpha_j$: $$\alpha_j \gets \frac{\alpha_j}{\sum_{i=1}^{N}{\alpha_i}} $$

Complete the skeleton for the following code to implement bldAdaboostDTreeStumps. Fill in the places with YOUR CODE HERE.

In [26]:
from math import log
from math import exp

def bldAdaboostDTreeStumps(data, features, target, nDTreeStumps):
    # start with unweighted data
    alpha = gl.SArray([1.]*len(data))
    weights = []
    curDTreeStumps = []
    targetVctr = data[target]
    
    for t in xrange(nDTreeStumps):
        print '====================================================='
        print 'Adaboost Iteration %d' % t
        print '====================================================='        
        # Learn a weighted decision tree stump. Use depthMax=1
        thsDTreeStump = bldWgtdDTree(data, features, target, obsWeights=alpha, depthMax=1)
        curDTreeStumps.append(thsDTreeStump)
        
        # Make predictions
        predictions = data.apply(lambda x: classifyWgtdDTree(thsDTreeStump, x))
        
        # Produce a Boolean array indicating whether
        # each data point was correctly classified
        isAccurate = predictions == targetVctr
        isInAccurt = predictions != targetVctr
        
        # Compute weighted error
        # YOUR CODE HERE
        errorWgtd = sum(alpha[isInAccurt]) / sum(alpha)
        
        # Compute model coefficient using weighted error
        # YOUR CODE HERE
        weight = (1.0 / 2.0) * log((1 - errorWgtd) / errorWgtd)
        weights.append(weight)
        
        # Adjust weights on data point
        adjustment = isAccurate.apply(lambda isAccurate : 
                                      exp(-weight) if isAccurate else exp(weight))
        
        # Scale alpha by multiplying by adjustment 
        # Then normalize data points weights
        ## YOUR CODE HERE 
        alpha = alpha * adjustment
        alpha = alpha / sum(alpha)
    
    return weights, curDTreeStumps

Checking your Adaboost code

Train an ensemble of two tree stumps and see which features those stumps split on. We will run the algorithm with the following parameters:

  • glbObsFit
  • features
  • target
  • nDTreeStumps = 2
In [27]:
wgtStumps, curDTreeStumps = bldAdaboostDTreeStumps(glbObsFit, features, target, nDTreeStumps=2)
=====================================================
Adaboost Iteration 0
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature term. 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 2 (9223 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (28001 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 1
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
In [28]:
def dspDTreeStump(tree):
    split_name = tree['featureSplit'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print "(leaf, label: %s)" % tree['prediction']
        return None
    split_feature, split_value = split_name.split('.')
    print '                       root'
    print '         |---------------|----------------|'
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '  [{0} == 0]{1}[{0} == 1]    '.format(split_name, ' '*(27-len(split_name)))
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '    (%s)                 (%s)' \
        % (('leaf, label: ' + str(tree['lft']['prediction']) if tree['lft']['isLeaf'] else 'subtree'),
           ('leaf, label: ' + str(tree['rgt']['prediction']) if tree['rgt']['isLeaf'] else 'subtree'))

Here is what the first stump looks like:

In [29]:
dspDTreeStump(curDTreeStumps[0])
                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [term. 36 months == 0]            [term. 36 months == 1]    
         |                                |
         |                                |
         |                                |
    (leaf, label: -1)                 (leaf, label: 1)

Here is what the next stump looks like:

In [30]:
dspDTreeStump(curDTreeStumps[1])
                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [grade.A == 0]                    [grade.A == 1]    
         |                                |
         |                                |
         |                                |
    (leaf, label: -1)                 (leaf, label: 1)
In [31]:
print wgtStumps
[0.15802933659263743, 0.1768236329364191]

If your Adaboost is correctly implemented, the following things should be true:

  • curDTreeStumps[0] should split on term. 36 months with the prediction -1 on the left and +1 on the right.
  • curDTreeStumps[1] should split on grade.A with the prediction -1 on the left and +1 on the right.
  • Weights should be approximately [0.158, 0.177]

Reminders

  • Stump weights ($\mathbf{\hat{w}}$) and data point weights ($\mathbf{\alpha}$) are two different concepts.
  • Stump weights ($\mathbf{\hat{w}}$) tell you how important each stump is while making predictions with the entire boosted ensemble.
  • Data point weights ($\mathbf{\alpha}$) tell you how important each data point is while training a decision stump.

Training a boosted ensemble of 10 stumps

Let us train an ensemble of 10 decision tree stumps with Adaboost. We run the bldAdaboostDTreeStumps function with the following parameters:

  • glbObsFit
  • features
  • target
  • nDTreeStumps = 10
In [32]:
wgtStumps, curDTreeStumps = bldAdaboostDTreeStumps(glbObsFit, features, 
                                target, nDTreeStumps=10)
=====================================================
Adaboost Iteration 0
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature term. 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 2 (9223 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (28001 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 1
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 2
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.D. (30465, 6759)
--------------------------------------------------------------------
Subtree, depth = 2 (30465 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (6759 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 3
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature home_ownership.MORTGAGE. (19846, 17378)
--------------------------------------------------------------------
Subtree, depth = 2 (19846 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (17378 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 4
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.B. (26858, 10366)
--------------------------------------------------------------------
Subtree, depth = 2 (26858 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (10366 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 5
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.E. (33815, 3409)
--------------------------------------------------------------------
Subtree, depth = 2 (33815 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (3409 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 6
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 7
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.F. (35512, 1712)
--------------------------------------------------------------------
Subtree, depth = 2 (35512 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1712 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 8
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 9
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.

Making predictions

Recall from the lecture that in order to make predictions, we use the following formula: $$ \hat{y} = sign\left(\sum_{t=1}^T \hat{w}_t f_t(x)\right) $$

We need to do the following things:

  • Compute the predictions $f_t(x)$ using the $t$-th decision tree
  • Compute $\hat{w}_t f_t(x)$ by multiplying the wgtStumps with the predictions $f_t(x)$ from the decision trees
  • Sum the weighted predictions over each stump in the ensemble.

Complete the following skeleton for making predictions:

In [33]:
def predictAdaboost(wgtStumps, curDTreeStumps, data):
    scores = gl.SArray([0.]*len(data))
    
    for i, thsDTreeStump in enumerate(curDTreeStumps):
        predictions = data.apply(lambda x: classifyWgtdDTree(thsDTreeStump, x))
        
        # Accumulate predictions on scores array
        # YOUR CODE HERE
        scores += wgtStumps[i] * predictions
        
    return scores.apply(lambda score : +1 if score > 0 else -1)
In [34]:
predictions = predictAdaboost(wgtStumps, curDTreeStumps, glbObsOOB)
accuracy = gl.evaluation.accuracy(glbObsOOB[target], predictions)
print 'Accuracy of 10-component ensemble = %s' % accuracy 
Accuracy of 10-component ensemble = 0.620314519604

Now, let us take a quick look what the wgtStumps look like at the end of each iteration of the 10-stump ensemble:

In [35]:
wgtStumps
Out[35]:
[0.15802933659263743,
 0.1768236329364191,
 0.09311888971129693,
 0.07288885525840554,
 0.06706306914118143,
 0.06456916961644447,
 0.05456055779178564,
 0.04351093673362621,
 0.02898871150041245,
 0.02596250969152032]

Quiz Question: Are the weights monotonically decreasing, monotonically increasing, or neither?

Reminder: Stump weights ($\mathbf{\hat{w}}$) tell you how important each stump is while making predictions with the entire boosted ensemble.

Performance plots

In this section, we will try to reproduce some of the performance plots dicussed in the lecture.

How does accuracy change with adding stumps to the ensemble?

We will now train an ensemble with:

  • glbObsFit
  • features
  • target
  • nDTreeStumps = 30

Once we are done with this, we will then do the following:

  • Compute the classification error at the end of each iteration.
  • Plot a curve of classification error vs iteration.

First, lets train the model.

In [36]:
# this may take a while... 
wgtStumps, curDTreeStumps = bldAdaboostDTreeStumps(glbObsFit, 
                                 features, target, nDTreeStumps=30)
=====================================================
Adaboost Iteration 0
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature term. 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 2 (9223 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (28001 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 1
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 2
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.D. (30465, 6759)
--------------------------------------------------------------------
Subtree, depth = 2 (30465 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (6759 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 3
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature home_ownership.MORTGAGE. (19846, 17378)
--------------------------------------------------------------------
Subtree, depth = 2 (19846 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (17378 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 4
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.B. (26858, 10366)
--------------------------------------------------------------------
Subtree, depth = 2 (26858 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (10366 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 5
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.E. (33815, 3409)
--------------------------------------------------------------------
Subtree, depth = 2 (33815 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (3409 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 6
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 7
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.F. (35512, 1712)
--------------------------------------------------------------------
Subtree, depth = 2 (35512 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1712 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 8
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 9
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 10
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.D. (30465, 6759)
--------------------------------------------------------------------
Subtree, depth = 2 (30465 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (6759 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 11
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.B. (26858, 10366)
--------------------------------------------------------------------
Subtree, depth = 2 (26858 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (10366 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 12
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 13
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.4 years. (34593, 2631)
--------------------------------------------------------------------
Subtree, depth = 2 (34593 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (2631 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 14
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 15
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.C. (27812, 9412)
--------------------------------------------------------------------
Subtree, depth = 2 (27812 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (9412 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 16
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 17
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.F. (35512, 1712)
--------------------------------------------------------------------
Subtree, depth = 2 (35512 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1712 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 18
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature term. 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 2 (9223 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (28001 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 19
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.B. (26858, 10366)
--------------------------------------------------------------------
Subtree, depth = 2 (26858 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (10366 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 20
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 21
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.D. (30465, 6759)
--------------------------------------------------------------------
Subtree, depth = 2 (30465 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (6759 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 22
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.F. (35512, 1712)
--------------------------------------------------------------------
Subtree, depth = 2 (35512 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1712 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 23
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.A. (32094, 5130)
--------------------------------------------------------------------
Subtree, depth = 2 (32094 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (5130 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 24
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 25
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.2 years. (33652, 3572)
--------------------------------------------------------------------
Subtree, depth = 2 (33652 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (3572 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 26
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.F. (35512, 1712)
--------------------------------------------------------------------
Subtree, depth = 2 (35512 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1712 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 27
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature home_ownership.OWN. (34149, 3075)
--------------------------------------------------------------------
Subtree, depth = 2 (34149 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (3075 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 28
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature emp_length.n/a. (35781, 1443)
--------------------------------------------------------------------
Subtree, depth = 2 (35781 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (1443 data points).
Reached maximum depth. Stopping for now.
=====================================================
Adaboost Iteration 29
=====================================================
--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature grade.C. (27812, 9412)
--------------------------------------------------------------------
Subtree, depth = 2 (27812 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------------------------
Subtree, depth = 2 (9412 data points).
Reached maximum depth. Stopping for now.

Computing training error at the end of each iteration

Now, we will compute the classification error on the glbObsFit and see how it is reduced as trees are added.

In [37]:
errorIter = []
for n in xrange(1, 31):
    predictions = predictAdaboost(wgtStumps[:n], curDTreeStumps[:n], glbObsFit)
    error = 1.0 - gl.evaluation.accuracy(glbObsFit[target], predictions)
    errorIter.append(error)
    print "Iteration %s, training error = %s" % (n, errorIter[n-1])
Iteration 1, training error = 0.421636578551
Iteration 2, training error = 0.433430045132
Iteration 3, training error = 0.400037610144
Iteration 4, training error = 0.400037610144
Iteration 5, training error = 0.384724908661
Iteration 6, training error = 0.384617451107
Iteration 7, training error = 0.382763808296
Iteration 8, training error = 0.384617451107
Iteration 9, training error = 0.382763808296
Iteration 10, training error = 0.384483129164
Iteration 11, training error = 0.382736943907
Iteration 12, training error = 0.381447453256
Iteration 13, training error = 0.381528046422
Iteration 14, training error = 0.380560928433
Iteration 15, training error = 0.380507199656
Iteration 16, training error = 0.378223726628
Iteration 17, training error = 0.378277455405
Iteration 18, training error = 0.378411777348
Iteration 19, training error = 0.378062540297
Iteration 20, training error = 0.378761014399
Iteration 21, training error = 0.379566946056
Iteration 22, training error = 0.378895336342
Iteration 23, training error = 0.378895336342
Iteration 24, training error = 0.378761014399
Iteration 25, training error = 0.378895336342
Iteration 26, training error = 0.378975929508
Iteration 27, training error = 0.379110251451
Iteration 28, training error = 0.378922200731
Iteration 29, training error = 0.379029658285
Iteration 30, training error = 0.378734150011

Visualizing training error vs number of iterations

We have provided you with a simple code snippet that plots classification error with the number of iterations.

In [38]:
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(range(1,31), errorIter, '-', linewidth=4.0, label='Training error')
plt.title('Performance of Adaboost ensemble')
plt.xlabel('# of iterations')
plt.ylabel('Classification error')
plt.legend(loc='best', prop={'size':15})

plt.rcParams.update({'font.size': 16})

Quiz Question: Which of the following best describes a general trend in accuracy as we add more and more components? Answer based on the 30 components learned so far.

  1. Training error goes down monotonically, i.e. the training error reduces with each iteration but never increases.
  2. Training error goes down in general, with some ups and downs in the middle.
  3. Training error goes up in general, with some ups and downs in the middle.
  4. Training error goes down in the beginning, achieves the best error, and then goes up sharply.
  5. None of the above

Evaluation on the test data

Performing well on the training data is cheating, so lets make sure it works on the glbObsOOB as well. Here, we will compute the classification error on the glbObsOOB at the end of each iteration.

In [39]:
errorIterOOB = []
for n in xrange(1, 31):
    predictions = predictAdaboost(wgtStumps[:n], curDTreeStumps[:n], glbObsOOB)
    error = 1.0 - gl.evaluation.accuracy(glbObsOOB[target], predictions)
    errorIterOOB.append(error)
    print "Iteration %s, test error = %s" % (n, errorIterOOB[n-1])
Iteration 1, test error = 0.42330891857
Iteration 2, test error = 0.428479103835
Iteration 3, test error = 0.398104265403
Iteration 4, test error = 0.398104265403
Iteration 5, test error = 0.379900904782
Iteration 6, test error = 0.380008616975
Iteration 7, test error = 0.379254631624
Iteration 8, test error = 0.380008616975
Iteration 9, test error = 0.379254631624
Iteration 10, test error = 0.379685480396
Iteration 11, test error = 0.379254631624
Iteration 12, test error = 0.377962085308
Iteration 13, test error = 0.379254631624
Iteration 14, test error = 0.377854373115
Iteration 15, test error = 0.378500646273
Iteration 16, test error = 0.377854373115
Iteration 17, test error = 0.377962085308
Iteration 18, test error = 0.377854373115
Iteration 19, test error = 0.378177509694
Iteration 20, test error = 0.376884963378
Iteration 21, test error = 0.377531236536
Iteration 22, test error = 0.376777251185
Iteration 23, test error = 0.376777251185
Iteration 24, test error = 0.376884963378
Iteration 25, test error = 0.376777251185
Iteration 26, test error = 0.376561826799
Iteration 27, test error = 0.376454114606
Iteration 28, test error = 0.376992675571
Iteration 29, test error = 0.376777251185
Iteration 30, test error = 0.376777251185

Visualize both the training and test errors

Now, let us plot the training & test error with the number of iterations.

In [40]:
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(range(1,31), errorIter, '-', linewidth=4.0, label='Training error')
plt.plot(range(1,31), errorIterOOB, '-', linewidth=4.0, label='Test error')

plt.title('Performance of Adaboost ensemble')
plt.xlabel('# of iterations')
plt.ylabel('Classification error')
plt.rcParams.update({'font.size': 16})
plt.legend(loc='best', prop={'size':15})
plt.tight_layout()

Quiz Question: From this plot (with 30 trees), is there massive overfitting as the # of iterations increases?

In [ ]: